The language is not the point

Published: 2025-11-21

Introduction

This post is decidedly non-quantitative (of course, this is par for the course with any blog post about this topic) -- vibes only! In the future, I hope to remedy this as part of a research program.

I'm creating a programming language by, primarily, driving LLM agents. The overwhelming consensus is that such an endeavor is foolhardy. Protests of "these systems will pull the wool over your eyes!" and "they can't do that, " are echoing through the halls of the internet as I type these words.

But I have a cheap trick up my sleeve: I've divested myself from the outcome. I'm a 5th year PhD student with a gung ho / risky attitude. It's gotten me in trouble before, and it's not a good attribute to become a successful and well-known researcher. I do stupid things, and spend much too long on them. But I think I've relegated myself to my instincts, and I'd prefer to do something I want to do rather than something I have to do.

In this case, however, I believe this exercise has a point. A programming language (and a compiler, and a runtime system) are complex computational artifacts. These systems integrate several layers of careful logic. Inspired by Terence Tao's recent meditations on the usage of LLMs to accelerate his mathematical inquiries, I'm a programming languages researcher, and I'm intensely curious: can these internet-scale amortized distributions multiply my own abilities (which are decidedly less than Tao's)?

Less self-centered: can these artifacts be effectively coerced to produce sound code which satisfies a complex computational design?

If the answer is yes, and there's repeatable strategies for exercising the yes -- that's a worthwhile endeavor. This is a question about pushing these systems to their limits. As far as I can tell from the "user-facing marketing", OpenAI and Anthropic and Google are focusing their marketing on fucking CRUD apps. I don't give a flying fuck about CRUD apps. Stop infantilizing me with your garbage. I want a sharp fucking tool that makes my processor bleed. Can you help me craft such a tool or not?

The blind leading the blind

What I'm not telling you is that I've already spent on order of 6-8 months using agents to do various things for me. These things include:

Claude Code was my introduction to these systems. Before Claude Code, I was a naysayer. Then I went on vacation with my family during a technical paper push, and had to crank out several Bayesian inference experiments to illustrate the features of a system I was developing, and I become a convert.

It was night and day for me. In the dark of night, I didn't believe that LLMs would lead to useful tools -- I didn't pay attention to them, I worked on my work, I was happy. Then, as I watched Claude Code write a "robot in a world" simulator as a probabilistic program in the probabilistic programming language we were developing, I had a bit of a spiritual enlightening. I remember the days that followed only as a haze of adrenaline, frantic text messages to some of my more patient collaborators, and frenzied pseudo-crackpot discussions with my wife and family.

My collaborators and I quickly ran into the limitations of Claude Code, especially under usage in an organization. Here's the central problem: every human needs to have a theory of the codebase to work effectively together. These tools can significantly damage your theory building abilities. If you crave automation, and you give into your demons -- you're shooting yourself in the foot. It's very obvious that this little problem hasn't been figured out yet.

To combat this in an organizational setting, one needs to adopt a philosophy of small, modular PRs. You can drive the agent to write code, but you need to be able to consume it. Reviewing, as many internet commentators have pointed out, becomes the bottleneck -- and now doubly so, because you didn't write the code in the first place ... you're behind on your theory building.

For now, at least, greenfield solo projects are "safer".

I quickly moved onto using both Codex and Gemini via the VSCode Copilot interface. Each of these tools has a "flavor", and one develops an intuitive theory of mind for each of them (which, of course, has to change with every fucking update to the agentic harness or model).

Here's some of my abbreviated take aways (including with more recent models) to convey some level of expertise to those who have wasted as much time as I have on these distributions:

The clear empirical answer for me is that there is value in these artifacts. But there are also obvious and significant limits as soon as you start trying to use these tools for serious business. If one takes Claude Code out of the box, for instance, and starts driving it to implement a language, you're quickly going to run into its weaknesses -- meaning the properties of the distribution over tokens which arise from the (tuned model + agentic harness) which are adversarial to your goals, as expressed in your context window with the agent.

A conjecture about the difficulty of context engineering

How exactly does one use agents to do something complicated? I'd argue that the following are the fundamental weaknesses for agents today -- thinking about them as (tuned models + agentic harnesses):

Therefore, trivially (you didn't need me to say this) -- using agents to make something complex scales poorly with complexity. You're fighting against the complexity to get into a "golden" part of the agent's distribution, and it becomes harder and harder to do this as the codebase or task becomes more complex.

Every new agentic IDE is attempting to tame this observation. Whether it's Kiro's "spec-driven development", or Antigravity's agent manager mode, or the internally hidden logic of Claude Code ... the "features" of these IDEs are attempts to confront this issue.

And then you see a video about using such an IDE to make a fucking webpage and you mentally prepare to deallocate yourself, harakari style, to combat the dishonor of cooking a thousand GPUs for that. Can we be serious?

Ignoring my disdain, the teams behind these systems have made this observation above, which is a true observation, and that's what we need to contend with if we want to get these things to summon a complex computational artifact.

Summoning

I've been experimenting with multi-agent orchestration patterns myself. Locally sourced, small batch terminal orchestration ... by hand.

Worse, it seems to be working ... at least, it seems to have increased the complexity limit on the computational artifact that I can effectively get agents to work on.

I no longer use a single agent, I always use combinations of agents. Why? Because it seems (we're in vibe land now!) like things go better when I keep agents in specific roles, and play them off against each other.

Somewhere ~40% into the context window for certain roles, magic starts to happen. Red team Codex starts nailing Claude to the wall using our spec as a crucifix. Claude squeals uncontrollably "You're absolutely right! Please, father, let me try again?" This goes on, and work gets done - syntax gets parsed correctly, lowered, executed. Holes are filled -- holes that only Codex can see (for I am blind, I have willingly blinded myself so that I cannot bear witness to my own sins).

Now, such debauchery can go off the rails: that's why I need to keep Codex honest - "what is the discrepancy between our implementation and the spec?" or "can we really handle this feature?" My prodding often finds gaps in understanding: if the gaps are too large, I must delete entire sequences of commits and try again. But code is cheap at this party, and flows like wine. Deletion, followed by a resampling ... is often more effective than incremental change. (Remember, greenfield I said!)

There's a visceral experience of finding a "golden context stream" through the system -- where you've got your documents in order, the randomness has aligned correctly, the attention heads are absolutely beaming, and you can tell that the silicon is locked in... they say that hope was the last thing remaining in Pandora's box.

I can't say with certainty today if one can engineer anything serious with this type of alchemy. It is, without question, a form of gambling -- but, like Monte Carlo, I'm holding out hope that some of my investigations will lead to design insights for controlling the randomness in a way which is significantly more repeatedly than my depraved ramblings to let Codex red team Claude.

What I've found so far is that, while multi-agent strategies tend to keep the overall system (codebase and agents) aligned with the specifications that I work on, the specifications can end up describing an artifact which is inconsistent or flawed. This is not an agent problem! This is a me problem!

Unfortunately, no agent has accelerated my thinking at this level of work. The closest has been ChatGPT's "research mode" with GPT 5.1 -- but ultimately, I need to be responsible for carefully understanding whether the features of my design are going to cohere together. But this, for me, is definitely the fun and imaginative part.

Samten bardo

If you work in this way for long enough, the agents and their roles become a new sort of digital appendage: fuzzy feelers that let you grope around in darkness, illuminating different parts of a Cthulhian elephant that you've brought into being. It's all fuzzy -- and it's giving "perfectly stable to bet a large portion of the economy on". Eventually, you have to cut your eyelids open and reach your hands into the filth. Somewhere in the mass of stochastic tentacles lies a perfect codebase -- like dumb monkeys, we try to sample Shakespeare.

With that, I'll leave you with a final note: there's something bizarre that I've experienced where failure of these multi-agent context engineering strategies indicates an issue in my design. This feels quite strange -- and it involves that "intuitive theory of mind" business where one becomes accustomed to agents solving certain categories of tasks -- if they can't solve something, you end up looking closely and realizing that your design has a flaw. This experience, perhaps, has been the most surprising of all.


← Back to home